Assessment Methods
Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents
Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory-grounded system that uses five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias-aware, learner-facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12-session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert-level agreement, and trained graders rate the AI-generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi-agent LLM systems can deliver equitable, high-quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback-rich learning becomes feasible for any course size or context, advancing long-standing goals of equity, access, and instructional capacity in education.
- Instructional Material > Course Syllabus & Notes (0.66)
- Research Report > New Finding (0.48)
- Education > Curriculum > Subject-Specific Education (0.94)
- Education > Assessment & Standards > Assessment Methods (0.71)
- Education > Educational Setting > Online (0.68)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.48)
Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning
Rüdian, Sylvio, Elsir, Yassin, Kretschmer, Marvin, Cayrou, Sabine, Pinkwart, Niels
Automated feedback generation has the potential to enhance students' learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students' submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students' submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.
- Europe > Germany > Berlin (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > France > Île-de-France > Hauts-de-Seine > Nanterre (0.04)
- Education > Educational Setting > Online (0.69)
- Education > Assessment & Standards > Assessment Methods (0.69)
- Education > Curriculum > Subject-Specific Education (0.51)
- Education > Educational Technology > Educational Software > Computer Based Training (0.46)
CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring
Cohn, Clayton, S, Ashwin T, Mohammed, Naveeduddin, Biswas, Gautam
Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.
- North America > United States (0.28)
- Europe > Montenegro (0.04)
- Europe > Czechia > Prague (0.04)
- (2 more...)
- Education > Educational Technology > Educational Software > Computer Based Training (1.00)
- Education > Educational Setting > Online (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
- (4 more...)
Challenges and Research Directions from the Operational Use of a Machine Learning Damage Assessment System via Small Uncrewed Aerial Systems at Hurricanes Debby and Helene
Manzini, Thomas, Perali, Priyankari, Murphy, Robin R., Merrick, David
-- This paper details four principal challenges encountered with machine learning (ML) damage assessment using small uncrewed aerial systems (sUAS) at Hurricanes Debby and Helene that prevented, degraded, or delayed the delivery of data products during operations and suggests three research directions for future real-world deployments. The presence of these challenges is not surprising given that a review of the literature considering both datasets and proposed ML models suggests this is the first sUAS-based ML system for disaster damage assessment actually deployed as a part of real-world operations. The sUAS-based ML system was applied by the State of Florida to Hurricanes Helene (2 orthomosaics, 3.0 gigapixels collected over 2 sorties by a Wintra WingtraOne sUAS) and Debby (1 orthomosaic, 0.59 gigapixels collected via 1 sortie by a Wintra WingtraOne sUAS) in Florida. The same model was applied to crewed aerial imagery of inland flood damage resulting from post-tropical remnants of Hurricane Debby in Pennsylvania (436 orthophotos, 136.5 gigapixels), providing further insights into the advantages and limitations of sUAS for disaster response. The four challenges (variation in spatial resolution of input imagery, spatial misalignment between imagery and geospatial data, wireless connectivity, and data product format) lead to three recommendations that specify research needed to improve ML model capabilities to accommodate the wide variation of potential spatial resolutions used in practice, handle spatial misalignment, and minimize the dependency on wireless connectivity. These recommendations are expected to improve the effective operational use of sUAS and sUAS-based ML damage assessment systems for disaster response.
- North America > United States > Texas > Brazos County > College Station (0.14)
- Africa > Eswatini > Manzini > Manzini (0.05)
- North America > United States > Pennsylvania > Tioga County (0.04)
- (3 more...)
- Aerospace & Defense (1.00)
- Government > Regional Government > North America Government > United States Government (0.66)
- Education > Assessment & Standards > Assessment Methods (0.63)
Towards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools
The rapid digitalisation of contemporary society has profoundly impacted various facets of our lives, including healthcare, communication, business, and education. The ability to engage with new technologies and solve problems has become crucial, making CT skills, such as pattern recognition, decomposition, and algorithm design, essential competencies. In response, Switzerland is conducting research and initiatives to integrate CT into its educational system. This study aims to develop a comprehensive framework for large-scale assessment of CT skills, particularly focusing on AT, the ability to design algorithms. To achieve this, we first developed a competence model capturing the situated and developmental nature of CT, guiding the design of activities tailored to cognitive abilities, age, and context. This framework clarifies how activity characteristics influence CT development and how to assess these competencies. Additionally, we developed an activity for large-scale assessment of AT skills, offered in two variants: one based on non-digital artefacts (unplugged) and manual expert assessment, and the other based on digital artefacts (virtual) and automatic assessment. To provide a more comprehensive evaluation of students' competencies, we developed an IAS based on BNs with noisy gates, which offers real-time probabilistic assessment for each skill rather than a single overall score. The results indicate that the proposed instrument can measure AT competencies across different age groups and educational contexts in Switzerland, demonstrating its applicability for large-scale use. AT competencies exhibit a progressive development, with no overall gender differences, though variations are observed at the school level, significantly influenced by the artefact-based environment and its context, underscoring the importance of creating accessible and adaptable assessment tools.
- Europe > Ireland (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.13)
- Europe > Austria > Vienna (0.13)
- (46 more...)
- Workflow (1.00)
- Summary/Review (1.00)
- Research Report > New Finding (1.00)
- (3 more...)
R.U.Psycho? Robust Unified Psychometric Testing of Language Models
Schelb, Julian, Borin, Orr, Garcia, David, Spitz, Andreas
Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present R.U.Psycho, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. R.U.Psycho is available as a Python package at https://github.com/julianschelb/rupsycho.
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.93)
- Transportation > Air (0.93)
- Education > Assessment & Standards > Assessment Methods (0.40)
eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback
Liu, Zhexiong, Litman, Diane, Wang, Elaine, Li, Tianwen, Gobat, Mason, Matsumura, Lindsay Clare, Correnti, Richard
The ability to revise essays in response to feedback is important for students' writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.
- North America > United States > Louisiana (0.24)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Kenya (0.04)
- (13 more...)
- Education > Curriculum > Subject-Specific Education (1.00)
- Education > Assessment & Standards > Assessment Methods (0.84)
- Education > Educational Setting > K-12 Education (0.66)
An Automated Explainable Educational Assessment System Built on LLMs
Li, Jiazheng, Bobrov, Artem, West, David, Aloisi, Cesare, He, Yulan
In this demo, we present AERA Chat, an automated and explainable educational assessment system designed for interactive and visual evaluations of student responses. This system leverages large language models (LLMs) to generate automated marking and rationale explanations, addressing the challenge of limited explainability in automated educational assessment and the high costs associated with annotation. Our system allows users to input questions and student answers, providing educators and researchers with insights into assessment accuracy and the quality of LLM-assessed rationales. Additionally, it offers advanced visualization and robust evaluation tools, enhancing the usability for educational assessment and facilitating efficient rationale verification. Our demo video can be found at https://youtu.be/qUSjz-sxlBc.
- Education > Assessment & Standards > Student Performance (0.97)
- Education > Assessment & Standards > Assessment Methods (0.61)
NLP Cluster Analysis of Common Core State Standards and NAEP Item Specifications
Camilli, Gregory, Suter, Larry
Camilli (2024) proposed a methodology using natural language processing (NLP) to map the relationship of a set of content standards to item specifications. This study provided evidence that NLP can be used to improve the mapping process. As part of this investigation, the nominal classifications of standards and items specifications were used to examine construct equivalence. In the current paper, we determine the strength of empirical support for the semantic distinctiveness of these classifications, which are known as "domains" for Common Core standards, and "strands" for National Assessment of Educational Progress (NAEP) item specifications. This is accomplished by separate k-means clustering for standards and specifications of their corresponding embedding vectors. We then briefly illustrate an application of these findings.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > New Jersey > Bergen County > Mahwah (0.04)
- North America > United States > Michigan (0.04)
- (2 more...)
Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales
Wallmark, Joakim, Josefsson, Maria, Wiberg, Marie
Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study.
- North America > United States > Tennessee > Knox County > Knoxville (0.04)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- (2 more...)